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Abstract 

Background: Next generation ultra-sequencing technologies are starting to produce extensive quantities of data 
from entire human genome or exome sequences, and therefore new software is needed to present and analyse 
this vast amount of information. The 1000 Genomes project has recently released raw data for 629 complete 
genomes representing several human populations through their Phase I interim analysis and, although there are 
certain public tools available that allow exploration of these genomes, to date there is no tool that permits 
comprehensive population analysis of the variation catalogued by such data. 

Description: We have developed a genetic variant site explorer able to retrieve data for Single Nucleotide 
Variation (SNVs), population by population, from entire genomes without compromising future scalability and 
agility. ENGINES (ENtire Genome INterface for Exploring SNVs) uses data from the 1000 Genomes Phase I to 
demonstrate its capacity to handle large amounts of genetic variation (>7.3 billion genotypes and 28 million SNVs), 
as well as deriving summary statistics of interest for medical and population genetics applications. The whole 
dataset is pre-processed and summarized into a data mart accessible through a web interface. The query system 
allows the combination and comparison of each available population sample, while searching by rs-number list, 
chromosome region, or genes of interest. Frequency and F ST filters are available to further refine queries, while 
results can be visually compared with other large-scale Single Nucleotide Polymorphism (SNP) repositories such as 
HapMap or Perlegen. 

Conclusions: ENGINES is capable of accessing large-scale variation data repositories in a fast and comprehensive 
manner. It allows quick browsing of whole genome variation, while providing statistical information for each 
variant site such as allele frequency, heterozygosity or F ST values for genetic differentiation. Access to the data mart 
generating scripts and to the web interface is granted from http://spsmart.cesga.es/engines.php 



Background 

The appearance of large-scale online compilations of 
human variation has profoundly changed the population 
genetics field in the last decade. Private companies such 
as Perlegen Sciences [1], global collaborations such as 
HapMap [2] and high density Single Nucleotide Poly- 
morphism (SNP) genotyping of the CEPH human gen- 
ome diversity panel by groups from the Universities of 
Stanford [3] and Michigan, have provided extensive var- 
iation catalogues for geneticists to examine differences 
amongst a wide range of human populations. But 
although most genome studies have released their raw 
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data to the public there has been a lack of web inter- 
faces that allow population genetics based interpretation 
of the data. Indeed, in the current era of rapidly expand- 
ing numbers of publicly released complete human 
sequences there is an evident need to develop online 
data browsers that can collate and represent portions of 
the data relevant for particular fields of research. 

The 1000 Genomes project http://www.l000genomes. 
org/ is a public initiative that aims to collect a very 
large proportion of variation detectable by next genera- 
tion sequencing techniques of human genomes from 
several worldwide populations. The first pilot study 
(Pilot 1) assessed the strategy of sharing data across 
samples on whole genome sequencing results with rela- 
tively low coverage (2-4x). It presented 179 genomes 
from the four different population panels previously 
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characterised by HapMap (CEU, CHB, JPT and YRI) 
describing -14 million variants. The recent release of an 
interim analysis of the project's Phase I has considerably 
enriched the data available: 629 entire genomes from 12 
different populations, describing -28 million variants. 
These populations are: individuals of African ancestry in 
Southwest USA (ASW), Utah residents with N & W 
European ancestry from the CEPH collection (CEU), 
Han Chinese in Beijing, China (CHB), Han Chinese 
South (CHS), Finnish in Finland (FIN), British in Eng- 
land and Scotland (GBR), Japanese in Tokyo, Japan 
(JPT), Luhya in Webuye, Kenya (LWK), individuals of 
Mexican ancestry in Los Angeles, California (MXL), 
Puerto Ricans in Puerto Rico (PUR), Tuscans in Italy 
(TSI), and Yoruba in Ibadan, Nigeria (YRI). 

Although the 1000 Genomes project has already 
started to release results there are few publicly available 
bioinformatics tools that allow thorough exploration of 
such data. The Integrative Genomics Viewer http:// 
www.broadinstitute.org/igv/home is a Java-based desk- 
top application that permits visual browsing of the 1000 
Genomes Pilot 1, 2, and 3 calls (among other tracks). 
Alternatively the 1000 Genomes Browser http://brow- 
ser.1000genomes.org/ is a web tool that permits visuali- 
zation of the variant sites against the reference 
sequence, and dynamic loading of tracks of interest 
(functional consequence, conservation, etc.). The latter 
provides a very simple and intuitive way to browse the 
1000 Genomes results, but it does not provide basic var- 
iation statistics for population studies such as allele fre- 
quency or genetic differentiation of the genomes 
included in the project. More importantly, the 1000 
Genomes Browser reviews the sequence surrounding 
just a single query at a time whether variant site, gene 
or chromosome segment. Furthermore, the 1000 Gen- 
omes browser is currently confined to the six Pilot 2 
sequences. 

Construction and content 

We have developed a human genome variant site brow- 
ser: ENGINES dedicated, in the first instance, to the 
flexible and thorough analysis of the Single Nucleotide 
Variation (SNV) catalogue generated from the 1000 
Genomes Phase I interim analysis, although it will sub- 
sequently integrate new whole genome sequence data 
from other sources as this becomes publicly available. 

Design and capabilities 

As shown in Table 1 the volume of data is already very 
large, and with the goal to aggregate all available new 
whole genome data, summarizing approaches are essential 
to allow easy data management and to perform quick 
non-batched queries [4]. The whole dataset is pre-pro- 
cessed using a pipeline of customized PERL scripts and 
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The comparison of all the variability information present on the 1000 
Genomes Phase I with HapMap release 28 indicates that although HapMap 
doubles the sample size, 1000 Genomes triples the number of genotypes due 
to the superior density of variants (this is particularly interesting in the YRI 
population which is now even more completely described than before). The 
number of non-monomorphic sites is reported as "variant sites". 
^Variant sites refer to the number of bi-allelic markers observed in each 
population group. Note that the Phase I does not contain information on tri- 
or tetra-allelic variants while in Pilot 1 there are more than 16,000 tri-allelic 
SNVs plus 12 tetra-allelic SNVs (data not shown). 

collated into a seven gigabyte MySQL data mart, contain- 
ing only the summarized statistics, arranged by population, 
including allele frequencies, heterozygosity or minor allele 
frequency (MAF). This data mart is then queried through 
a PHP web interface with the main aim of permitting mul- 
tiple SNV queries of entire genomes with a single step, 
dictated by user-defined nucleotide range, HGNC gene 
symbol list or rs-number list applied to the users selection 
from different global population panels (Figure 1). 

The statistics tab displays a table describing each var- 
iation result in columns: variation code, chromosome, 
chromosome position, gene, reference allele (from the 
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AFRICA 169 A 0.027 0.030 0.052 - 0.165 0.073 

EUROPE 261 A 0.494 0.360 0.500 - 0.346 0.150 

EAST ASIA 177 A 0 0.000 0.000 - 0.000 0.000 

AMERICA 22 A 0.205 0.227 0.325 - 0.076 0.102 

Figure 1 Data workflow. Pre-processing of large-scale human variation sources, creation of a data mart from population and variation specific 
data plus display of results through the web interface. The information taken from dbSNP is used just for mapping purposes - full content is not 
present on the data mart. HapMap release 28 describes 4,166,638 SNPs all listed by dbSNP build 132, 3,654,377 of these are present in 1000 
Genomes Phase I. A total of 28,210,483 unique variants have been detected by the 1000 Genomes Phase I interim analysis, 16,313,540 already 
listed in dbSNP build 132 (which currently comprises 29,133,600 SNPs in total). Screenshots show a single SNP search for rs4988235; this SNP is 
located in the MCM6 gene but influences the lactase gene (/.CD; the intercontinental global F ST value is higher than expected (highlighted in 
red; 0.320) as it corresponds to the locus that shows the strongest signal of positive selection in the human genome. 



current human reference genome GRCh37), ancestral 
allele (from the Chimpanzee genome), alleles found in 
all present genotypes, populations queried, number of 
samples (N), the minor allele (MA) and its frequency 
(MAF), observed and expected heterozygosities (Hqbs 
and H EX p), local inbreeding (F s ), genetic differentiation 
(F ST , which is presented on different colours depending 
on meaning steps: under 0.05, 0.15, 0.25 and above 0.25) 
and informativeness of population group assignment 
(I n ). In ENGINES the emphasis is on multiple queries as 
a flexible, and in terms of genome portions that can be 
queried, broader alternative to the single marker queries 
offered by e.g. the 1000 Genomes browser. 

Rapid responses to queries of dense genomic data have 
been engineered into the browser by use of pre-calculated 
SNV allele frequencies based on population groupings, an 
approach already successfully implemented in the 



population-based SNP frequency browser: SPSmart [5]. 
ENGINES therefore exploits one of the major assets of the 
1000 Genomes Pilot 1 data, the improved detection and 
characterization of low frequency nucleotide variation, 
whether defined by population, genome position or overall 
MAF with linked references made at the same time to 
existing data in dbSNP or HapMap. For example, 
ENGINES can be used to search in batch mode for: 

1) SNVs in specific genes or gene families; 

2) SNVs at varying frequencies in different global 
population panels; 

3) Novel variants or SNVs at very low MAF, which 
are now adequately catalogued and validated; For 
any selected SNV set, ENGINES can also calculate a 
range of statistical indices of interest for human 
population genetics studies. 
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Maintaining the data mart 

The update frequency of the databases currently 
accessed by ENGINES varies considerably. Thus, while 
dbSNP is expected to release updates on a yearly basis, 
having been updated once or twice a year since 2004, 
Phase I is a static resource, and the project's final data 
releasing policy has not been publicly stated. The data 
mart will be updated with the 1000 Genomes final var- 
iant data upon release, in addition relevant whole gen- 
ome sequencing data in the public domain from other 
initiatives will also be collated and included. 

Originally, ENGINES used 1000 Genomes' Pilot 1 as 
an appropriate testing dataset. It was mapped to the old 
NCBI36/hgl8 human genome reference, and for that 
reason we were forced to use dbSNP build 130 as the 
most up to date standard for describing all variants 
when possible. When the 1000 Genomes project 
released this Phase I interim analysis we decided to 
update our tool to a more appropriate testing dataset, 
implying adapting the data parsing scripts and upgrad- 
ing the mapping reference to the new GRCh37/hgl9. 
This later fact allowed ENGINES to update the variants 
description reference to dbSNP build 132, and consider- 
ing that human reference versions tend to be fixed for a 
long time this should allow the internal data marts to be 
easily updated when new data is released, either from 
the number of genotypes side (new projects or existing 
projects update) or either from the variants description 
point of view (dbSNP updates, which occur approxi- 
mately once a year). 

The most common population genetics statistical 
indices have been implemented and summarized in the 
ENGINES data mart, but other metrics of interest could 
be easily implemented with just the raw data pre-pro- 
cessing script requiring updates: equivalent to two com- 
puting days due to the flexibility of the pipeline 
developed. In fact, and although it took ENGINES 1 
month to be adapted to the new 1000 Genomes Phase I 
interim analysis data release policy, updating the data 
mart with the whole project's final data would take only 
1 week even considering that the number of genomes is 
expected to be multiplied by 5. 

Utility and discussion 

Since several alternative means are available for 
researchers to access 1000 Genomes SNP data it is 
important to outline the advantages offered by the 
ENGINES browser in comparison to other approaches, 
which we see as complementary in their output, rather 
than competing to provide the same type of data. 
ENGINES is primarily designed to serve population 
genetics studies and therefore has several key features 
built in: 



1. A straightforward system to download the indivi- 
dual genotypes for the SNPs, genes and populations 
queried. This permits direct input into population 
analysis algorithms such as Structure [6] or Arlequin 

in 

2. Each database, population and SNV can be 
visually compared side by side, and the relevant data 
for SNVs and populations can be downloaded in one 
session from each database query. 

3. F ST values, amongst other metrics, can be collated 
for the entire genome-wide or exome SNV 
catalogue. 

4. Lists of SNPs or genes are easily handled offering 
a more rapid and straightforward system than the 
SNP by SNP queries of the 1000 Genomes browser. 

5. Genotyping coverage can be assessed at a glance 
by reviewing which SNPs and databases show 
incomplete genotyping. 

6. Different filters are available that allow the selec- 
tive listing of sets of variants according to different 
thresholds defined by the user (e.g. F S t > MAF, etc). 

ENGINES processed more than 7.3 billion genotypes 
and -28 million unique variants in the Phase I interim 
analysis of the 1000 Genomes project (Table 1), of 
which 11.9 million were not previously described in 
dbSNP 132 (Figure 1). To illustrate the ease with which 
the ENGINES browser can add extra data to existing 
genome-wide analyses, of relevance for population 
genetics studies, we collated the total variant number by 
population group (Table 1). As expected from the 
demographic history of human populations, ENGINES 
clearly indicates the two sub-Saharan samples (LWK 
and YRI) contain more variants than any other popula- 
tion or set of populations, followed by the African- 
American sample (ASW). The data in this population 
break-down is different to the one provided by the 1000 
Genomes analysis [8] because the latter targeted low 
coverage analysis of only the CEU, YRI, CHB, and JPT 
(Pilot 1) or exon regions (Pilot 3). Our data reveals 
interesting differences of SNP density that could contri- 
bute to the study of global patterns of natural selection 
(Table 1). 

F ST is a metric of genetic differentiation [9] between 
populations. It is also well known that the action of nat- 
ural selection can locally cause systematic deviation in 
F ST values for a selected gene and nearby markers. 
Thus, when compared with the action of a neutral evol- 
ving gene, high F ST values might signal the action of 
local directional selection, while a decrease of F ST values 
would be suggestive of balancing selection. Analysis of 
F ST values on a genome-wide scale has already been 
demonstrated to be very useful for mapping genes 
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under selection [10]. The 1000 Genomes pilot project 
has allowed the calculation of F ST values for the first 
time in the framework of a whole genome sequencing 
project [8], and has already revealed preliminary features 
relating to new regions that could have been subject to 
natural selection. In a step forward, ENGINES provides 
F ST values for different population or continental combi- 
nations selected by the user and centred on the most 
current data release of 1000 Genomes. Access to this 
information is straightforward, and genotypes can be 
easily downloaded ad hoc for the regions of interest in 
order to carry out further analyses. By way of example, 
additional file 1 provides a snapshot of genome-wide 
F ST values when considering a four-way inter-continen- 
tal comparison (Africa, Europe, Asia, and America). 
Additional file 2 records the top F S t values (>0.9) 
plotted in Figure SI, indicating that a large proportion 
of these values fall within known genes but notably a 
significant proportion are also located in uncharacter- 
ized genomic regions; therefore, providing new targets 
of considerable interest for further evolutionary and 
population genetic research. In addition, analysis of 
populations to a more extended intra-continental scale 
allows a refinement in the ability to search at greater 
population depth signals of localized adaptation. 

Finally, an indirect assessment of the quality of 
ENGINES can be undertaken by the user by comparing 
SNP frequencies in Phase I with those of HapMap for 
the overlapping SNPs and populations (CEU, CHB, JPT, 
and YRI). Minor differences or discrepancies are possi- 
ble but can be attributed to missing data or potential 
genotyping errors (due e.g. to Phase I SNV detection 
based on ultra-sequencing at low coverage). We have 
indeed observed genotyping discrepancies between gen- 
otypes reported in HapMap and those reported in Phase 
I for the same samples (data not shown). 

Conclusions 

ENGINES is capable of accessing large variation data 
repositories in a fast and comprehensive manner. We 
have shown that 1000 Genomes variant data, which 
represents the largest current whole human genome var- 
iation repository, is easily summarized and queried by 
ENGINES with a straightforward yet thorough approach 
for handling multiple sites across multiple genomes. 
ENGINES allows fast and easy browsing of whole gen- 
ome variation by using a simple and intuitive web inter- 
face that performs queries in seconds and displays 
results in an efficient manner, while providing statistical 
information of each variation site such as frequency, 
heterozygosity or genetic differentiation among popula- 
tions that are already pre-calculated and presented on 
demand. 



Availability 

The data mart generating scripts are a set of Perl files 
that are freely available on the software section of 
ENGINES. Access to these scripts and to the main web 
interface is granted from http://spsmart.cesga.es/engines. 
php 

Additional material 



Additional file 1: Figure SI - Genome-wide F ST values. Chromosome 
position in Mb is given in the X-axis, and F ST values are plotted on the Y- 
axis. F ST values are shown in black or red (red shows values that are 
exceptionally high: corresponding to the upper 2.5% of the empirical 
distribution of F ST values). The yellow line shows the average of F ST 
values for non-overlapping genomic windows of 1 Mb. Gaps correspond 
to heterochromatic staining regions near centromeres. 

Additional file 2: Table SI - Top F ST values. List of SNVs showing the 
top F ST values (above 0.9) for the four main continental group and their 
pairwise combinations (AFR = Africa; EAS = East Asia; EUR = Europe, and 
AME = America). Genes and rs-numbers are provided when available. 
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